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Abstract 

Maximum likelihood fits to data can be done using binned data (histograms) and unbinned data. 
With binned data, one gets not only the fitted parameters but also a measure of the goodness of fit. 
With unbinned data, currently, the fitted parameters are obtained but no measure of goodness of 
fit is available. This remains, to date, an unsolved problem in statistics. Using Bayes theorem and 
likelihood ratios, we provide a method by which both the fitted quantities and a measure of the 
goodness of fit are obtained for unbinned likelihood fits, as well as errors in the fitted quantities. 
We provide an ansatz for determining Bayesian a priori probabilities. 
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I. INTRODUCTION 



We outline a method by which goodness of fit measures can be calculated in an unbinned 
likelihood analysis. We are able to also calculate the probability density function of the fitted 
variables and hence their errors in a rigorous manner. We briefly describe the currently used 
method of "maximum likelihood", originally due to R.A.Fisher ffl. Let s denote a set 
parameters defining our theoretical model used to describe data. Example of s are the mass 
of the top quark or the lifetime of a particle. The symbol s (for signal) can in general denote 
a discrete or continuous set of variables. Let c denote a set of observations describing a 
high energy physics event and there are n events in our dataset. In general, for each event, 
c can be a vector of dimension d. Let P(c\s)dc describe the probability of observing the 
configuration c in the d-dimensional phase space volume dc given the theoretical parameter 
set s. Thus P{c\s) is a probability density function (pdf) in the variable c and obeys 

j P{c\s)dc = 1 (1) 

Then one can define a likelihood £ of observing the dataset as 

i=n 

C = l[P(c t \s) (2) 

8=1 

The maximum likelihood point can be found of observing by minimizing the negative log- 
likelihood —log e C defined as 

i=n 

-log e C ^ log e (P(ci\s)) (3) 
i=i 

while varying the parameters s either analytically or numerically to obtain the best values 
s* of s that fit the data. 

At the maximum likelihood point, s*, the best fit values of s, are obtained. There is 
however no measure of the goodness of fit, since the likelihood at the optimal value is not 
normalized to anything. There is strictly no measure for the error on the fitted parameters, 
since C is not a probability density function of s, though people have calculated errors 
by treating the —log e C at the minimum as though it were equivalent to \x 2 - Such error 
calculations are not hitherto considered rigorously justifiable. 

Unbinned likelihood fits, despite these disadvantages, are extremely useful in finding s* 
since one does not have to treat bins with small populations in a special manner as would 
be the case for binned fits. 
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In this paper we use Bayes theorem to rectify the above disadvantages. In the process, 
we obtain a measure for the goodness of fit and also P(s\c), the posterior pdf of s, enabling 
us to calculate the errors of the fitted values in a rigorous way. 

II. BAYES THEOREM 

We derive Bayes theorem here for the sake of completeness and to illustrate the main 
ideas. In the Bayesian approach ||, the theoretical parameters s can have a probability 
distribution both a priori and a posteriori. The a priori distribution refers to the knowledge 
of s before the given set of observations are made. The a posteriori probability distribution 
refers to the distribution of s, given the set of observations c. 

A. Joint and Conditional Probabilities 

We define a joint probability density for the theory parameters s and the observables c 

as 

dPjoint Pjointi,^ j cjds dc (4) 

which is the probability that s occurs in interval s and s + ds and c occurs in a volume 
element dc centered around c. 

We define the conditional probability density 

dPconditional = P(s\c)ds (5) 

as the probability density of observing s in the interval s and s + ds given that c occurs in 
a volume element dc centered around c . 

Similarly, the conditional probability density 

dPconditional = P{c\s)dc (6) 

is defined as the probability density of observing c in a volume element dc centered around 
c, given that s occurs between s and s + ds . Then, by the laws of probability , we can write 
the joint probability 

dPjoint = Pjoint{s, cjdsdc = P(c\s)dc x P(s)ds (7) 
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Where P(s) is the a priori probability of observing s in interval s and s + ds , and P(c\s)dc 
is the probability of observing c given s . One can also obtain the same joint probability, by 
first observing c with a priori probability P{c) and then using the conditional probability 
P(s\c) , i.e. 

dPjoint = Pjoint(s, c)dsdc = P(s\c)ds x P(c)dc (8) 

By equating |7| and ^ one gets the fundamental relation leading up to Bayes Theorem. 

dPjoint = P{c\s)dc x P(s)ds = P(s\c)ds x P(c)dc (9) 

Expressed in terms of densities, dropping ds and dc terms, this yields 

P(c\s) x P(s) = P{s\c) x P(c) (10) 

One is interested in evaluating P(s\c), the probability of the theory parameters, given a set 
of observations c. This becomes, 

PMO - (ii) 

The a priori probability P(c) is not an independent quantity, given the a priori probability 
P{s) which represents the knowledge of s before the set of observations c . The reason for 
this is that P(s\c) integrated over s must add up to unity. 

B. Some Normalization Formulae 

Integrating over one of the variables in the joint probability yields, using equation 0, the 
following relations. 

P{c) = J P loint (s,c)ds = J P(c\s) x P(s)ds (12) 

where the = sign is the definition of the a priori probability P(c), since one integrates the 
joint probability Pj int{s, c) over all values of s . This then yields 

P( c ) = J p(p\s) x P(s)ds (13) 

also, integrating the joint probability over c, one gets 

P{s) = J P Joint (s,c)dc= J P(c\s) x P{s)dc (14) 
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i.e. 



P(s) 



P(c\s) x P(s)dc 



orJp(c\s)dc = l 



(15) 
(16) 



Similarly, using equation |8], one gets relations similar to the above with c and s interchanged. 
Summarizing, one gets the following normalization relations. 



P(c) = J P(c\s) x P(s)ds 
P(s) = J P(s\c) x P(c)dc 
J P{s)ds = 1 



P(c)dc = 1 



/P(C| S ),C=1 



Substituting in [11], one gets the derivation of Bayes Theorem. 

P(c\s) x P(s) 

P(s\c) 



(17) 
(18) 
(19) 
(20) 
(21) 
(22) 



(23) 



/ P(c\s) x P(s)ds 

The above equation normalizes to unity as per equation ^2[ This is the central expression 
of Bayes' theorem. 



C. Observation of Many Configurations 

Now we come to one of the more beautiful properties of formula |23], namely it is recursive. 
Let us observe two separate configurations say, C\ and C2 . Then equation ^3] yields for c\ , 

P(s\c ) - P{Clls) X P(S) (24) 
P(S,Cl) " J P( Cl \s) x P(s)ds (24) 

Now we observe Cq. We wish to compute P(s\ci,C2) , the probability of s given c\ and ■ 

We can then replace the a priori probability for s, P{s) in equation ^3] by the probability 

of s after observing C\ (i.e. P(s\c\)) to calculate the probability of s given c\ and C2 . This 



yields, 



P(S|Cl ' C2) " J P(c 2 \s) X P(s\ Cl )ds (25) 
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Substituting for P(s\cx) from equation |24|, we get 



p( | v P(c 2 \s)P( Cl \s)P(s)/ J P( Cl \ S )P(s)ds 
{s\ Cl ,c 2) j p(c2 | s) p (ci | s) p (s)ds/ jp (ci | s)p(s)ds I °J 

P(c 2 |s)P( Cl |s)P(s) . . 



generalizing, P(,| Cl , e 2 ...c„) = j p{c ^' PMs)p{cAs)p{s)ds (28) 
Another way to think about equation ESI is to think of the n configurations as one massive 



super configuration c n , which also obeys the Bayes theorem equation 23 



P(s\c ) - P ( C "' g ) X P ^ (29) 
where P(c n |s) = P{c n \s)...P{c 2 \s)P(c 1 \s) (30) 

It should be noted that the probability P(c n |s) obeys the normalization condition E_T]. Equa- 
tion ^ is just the law of multiplication of independent probabilities. This implies that it 
is possible to chain probabilities in Bayes theorem as in equation |25| if and only if the con- 
figurations are statistically independent. This is certainly true in the case of high energy 
physics events. 

The expression for a posteriori probability P(s\c n ) in equation p9| cannot be used as is 
unless one knows P(s), the a priori probability of s. In the "Bayesian approach", people use 
various guesses for P(s) and a lot of care and energy are expended in arriving at "reasonable" 
functions for P(s). 

D. Likelihood Ratios 



We now recast the Bayes theorem equation |29] as a set of likelihood ratios Cn 



P(s\c n ) _ P(c n \s) 

^— (31) 

where we have substituted the function P(c n ) for the normalizing integral in the denominator 



using equation |H| The likelihood ratio has a very important invariant property. It is 
invariant under the transformations of variable sets c — > d and s — > s' where d and s' are 
functions of the variable sets c and s. It is possible to ask what exact variables one uses 
to form the vector c. For instance,when a jet is measured experimentally, does one use the 
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energy, pseudo-rapidity and azimuth of the jet or the three components of the energy three 
vector as components of c? Clearly, the probability density function P(c\s) will depend on 
the choice of the variable set c since, 

dc 

P(c'\s) = \^\P(c\s) (32) 

where, |-r§-| denotes a Jacobian of transformation to go from the set of variables c to d. 
However, the same Jacobian ocurs in the denominator of the C-r, hence the likelihood ratio 
is unaffected by the transformation. The same argument can be made with respect to 
transformations of the variable set s — > s'. These are extremely important properties, so we 
henceforth work with the likelihood ratio C-ji and not the likelihoods L which do not possess 
these properties. 



III. THE PRINCIPLE OF MAXIMUM LIKELIHOOD RATIOS 

The equation [31] for £^ can be expanded as follows. 

P(c n \s) P{ Cl \s) P(c 2 \s) P(c n \s) 

LU ~ P( Cn ) " P( Cl ) P(C 2 ) P( Cn ) ^ 

where we have used the independence of a priori probabilities for P(cj), i — l,n. Similarly, 
one gets expressions, 

_ P(s\c n ) _ P(s\ Cl ) P(s\c 2 ) P(s\c n ) 

where we have derived equation from equation |33] by applying equation |3l| to the likelihood 
ratios of the individual events in the product. In order to find the optimal set of parameters 
s, we maximize the likelihood ratio C-ji in equation |33| with respect to s. This is equivalent 
to minimizing the negative log likelihood ratio log e £ft. 

dlog e C n % sA dlog e P(ci\s 



J2^p^ = (35) 



ds ds 
i=i 

Notice that this is the same set of equations that one gets when maximizing the likelihood 
as in equation |3], since the a priori probabilities P(ci) are constant with respect to variations 
in s. So one gets the same set of optimal variables s* whether one maximizes the likelihood 
C or the likelihood ratio C-ji- However, at the optimum, the likelihood ratio can be used 
to obtain a goodness of fit parameter as we show below, whereas the likelihood method 



would be unable to provide this information. One can now ask what the minimum value 
of £ft is with respect to variations in the event configuration, for a fixed value of theory; 
i.e. what event configurations produce the minimum value of the negative log likelihood? 
Differentiating equation ^ with respect to q, one gets, 

dlog e C n dlog e P(ci\s) dlog e P(d 



dci dci dci 

i.e 

dlog e P{ci\s) dlog e P(c,j) 



(36) 



(37) 



dci dci 

P( Ci \s) = P(q) (38) 

The equation |38| implies that the lowest value of the likelihood ratio occurs when the ex- 
perimental probability density P(c) and the theory probability density P{c\s) are the same 
at the observed events. The negative log likelihood is zero at this point, yielding the best 
possible fit. 



IV. EVALUATING THE FUNCTION P(c) AND THE GOODNESS OF FIT 

The key point to note is that just as P(s) is the a priori probability of the theoretical 
parameter s, P{c) is the a priori probability of the data. In order to evaluate the likelihood 
ratio Ctz at the maximum likelihood point, one needs to evaluate the function P(c) at 
the observed event configurations ci,C2 ■ ■ ■ c n . So the problem to solve is this: given the 
event configurations ci,C2 ■ ■ - c n , what is their probability density? Well known methods 
exist to estimate the pdf's given discrete event distributions. These are collectively titled 
probability density estimators (PDE), which have recently found application in high energy 
physics analyses [f|. 

As noted above, the probability density function P(c) is the a priori pdf of the data. In 
previous applications, to the author's best knowledge, the function P(c) was subsumed into 
the equation [l^ and expressed in terms of an unknown P(s). This resulted in the theory pdf 
P(c\s) being evaluated at the data points Ci, c 2 . . . c n , but not the data pdf\ It is precisely 
this failure to evaluate P(c) given c that has led to the absence of goodness of fit criteria in 
unbinned likelihood fits. 

In binned likelihood fits, one fits a theoretical curve to a binned set of data points. Two 
distributions, those of theory and data, are involved in providing a goodness of fit measure 
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such as x 2 m the binned approach. In the unbinned method, however, one finds the maximum 
likelihood point by evaluating the theoretical function P(c\s) at the data points Ci,i = l,n. 
There is only one distribution involved, namely theory! One has hitherto ignored P(c), by 
subsuming it into a normalization constant. We rectify this lapse here and restore P(c) to 
its proper role, namely, the pdf of the data. 

A. Probability Density Estimators 

Let c", a — 1, d denote the components of the d-dimensional vector c for the i th event. 
Then we can define the mean vector < c a > as 



i=n 

1 

< C" > 



i=n 

~J2 C ? ( 39 ) 



n 
i=i 



The covariance (or error) matrix E of c is defined as 

E a,f3 =< c a c (3 > _ < Q a >< £ > ^ 

where the <> implies average over the n events. The Hessian matrix H is defined as the 
inverse of E. One can define a multivariate Gaussian Kernel Q(c) as 

Q{c) = — 1= f exp( 41 

where the repeated indices imply summing over and the parameter h is a "smoothing pa- 
rameter", which has[|J a suggested optimal value h m n~ l ^ d+i \ The pdf of c is then given 
by 

- i=n 

P(c)^PDE{c) = -Y,G{c-c l ) (42) 

Simply put, one takes an arbitray point c in configuration space, calculates the separation 
from this point to all the measured points and sums up the values (at c) of the Gaussians 
that are centered at the measured points. This sum is divided by the number of Gaussians, 
which equals n. Since the Gaussians are all normalized to unity, P{c) obeys equation 2H. 



One can feed in any value of c and the PDE will provide a probability density at that value 
of c. It is clear that the PDE method is generalizable to arbitrary dimensions and will 
provide us with P(c). One should note that the Gaussian Kernel function depends on n, 
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the number of events in the sample. This dependence is through the smoothing parameter, 



which goes to zero as n — > oo. In this limit, equation ^2] becomes 

P(c) = J P^G^c-c^da (43) 

This implies that 

Goo(c - Ci) = lim G(c - Cj) = 5(c - Cj) (44) 

n— >oo 

There exist generalizations [f| of the above scheme where the covariance matrix is made 
locally variable that can estimate pdf's with greater complexity albeit at a cost in computing 
speed. In what follows, we introduce a method by which the smoothing factor can be made 
a function of the configuration variables c in an iterative fashion, which may be equivalent 
to varying the covariance matrix locally. 

V. AN ILLUSTRATIVE EXAMPLE 

We illustrate the ideas discussed above with a simple one-dimensional example of events 
in which the observable c consists of decay times distributed exponentially with a decay 
constant s=1.0 in arbitrary units. The conditional probability P{c\s) defines our theoretical 
model and is given by 

P(c|s) = -exp(--) (45) 
s s 

The PDE one dimensional Gaussian kernel for this simple case would be given by 

= (vibr*"'-^?' (46) 

We generate a thousand events for which the smoothing parameter h is calculated to be 0.125 
as per equation^ h = 0.5n _1 ^ d+4 ' ) . Figure [I] shows the generated events, the theoretical 
curve P(c\s) and the PDE curve P(c) normalized to the number of events. 

The PDE curve is a poor estimator of the data near the cutoff at c=0. This is because 
the Gaussians centered at values of negative c would have contributed to the curve near c=0. 
Also, for large values of c, data are sparse and the Gaussian approximation with constant 
smoothing factor h finds it difficult to approximate the data. We choose to restrict our 
fitting to a fiducial interval in time t\ < c < t% — 1 < c < 5. Both the theoretical model 
P{c\s) and the PDE likelihood curves are renormalized so that they integrate to unity in 
the fiducial interval. 
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FIG. 1: Figure shows the histogram (with errors) of generated events. Superimposed is the theo- 
retical curve P(c\s) and the PDE estimator (solid) histogram with no errors. 

A. Iterative Determination of the Smoothing Factor 

The expression h ~ n^ 1 ^ d+A ' > clearly is meant to give a smoothing factor that decreases 
slowly with increased statistics n. It is expected to be true on average over the whole 
distribution. However, the exponential distribution under consideration has event densities 
that vary by orders of magnitude as a function of the time variable c. In order to obtain a 
function h(c) that takes into account this variation, we first work out a PDE with constant 
h and then use the number densities obtained thus [0 to obtain h(c) as per the equation 

nPDE(c) x - (Ui 



A(c) = [n^w) (47) 

The equation is motivated by the consideration that a uniform distribution of events 
between t\ and t 2 has a pdf = l/(t 2 — £1) whereas the real pdf is approximated by PDE. 
The function h(c) thus obtained is used to work out a better PDE(c). This process is 
iterated three times to give the best smoothing function. 

We generate n=1000 events in the fiducial interval. If now we were to superimpose a 
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Gaussian with 500 events centered at c=2.0 and width=0.2 on the data, the PDE estimator 
will follow the data as shown in Figure [| This shows that the PDE estimator we have is 
adequate to reproduce the data, once the smoothing parameter is made to vary with the 
number density appropriately. 



2002/06/06 12.53 

Generated events and PDE comparison 




Time (arbitrary units) 



FIG. 2: Figure shows the histogram (with errors) of 1000 events in the fiducial interval 1.0 < c < 5.0 
generated as an exponential with decay constant s=1.0. with a superimposed Gaussian of 500 
events centered at c=2.0 and width=0.2. The PDE estimator is the (solid) histogram with no 
errors. 

The smoothing function h(c) for the events in Figure |2| is shown in Figure [| It can be 
seen that the value of h increases for regions of low statistics and decreases for regions of 
high statistics. Superimposed is the constant smoothing factor obtained by the equation 
h « 0.5n -1// ( d+4 ) = 0.5n~ 0,2 , with n being the total number of events generated, including 
those outside the fiducial volume. 
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Smoothing factor as a function of time 
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FIG. 3: The variation of h as a function of c for the example shown in Figure The variation 
of the smoothing parameter is obtained iteratively as explained in the text. The flat curve is a 
smoothing factor resulting from the formula h ~ 0.5n -1 /( rf+4 ) . 

B. An Empirical Measure of the Goodness of Fit 

The negative log-likelihood ratio NCC1Z = —log e C-ji at the maximum likelihood point 
now provides a measure of the goodness of fit. In order to use this effectively, one needs 
an analytic theory of the sampling distribution of this ratio. This is difficult to arrive at, 
since this distribution is sensitive to the smoothing function used. If adequate smoothing is 
absent in the tail of the exponential, larger and broader sampling distributions of NCC1Z will 
result. One can however determine the distribution of MCCTZ empirically, by generating the 
events distributed according to the theoretical model many times and determining AfCCTZ 
at the maximum likelihood point for each such distribution. The solid histogram in figure |] 
shows the distribution of TV CCTZ for 500 such fits. This has a mean of 2.8 and an rms 
of 1.8. The dotted histogram shows the corresponding value of AfCCTZ for the constant 
value of smoothing factor shown in figure [3|. This distribution is clearly broader (rms=2.Q3) 
with a higher mean(=9.1) and thus has less discrimination power in judging the goodness 
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of fit than the solid curve. We now fit the same exponential background distribution with 

2002/06/07 14.49 




FIG. 4: The solid curve shows the distribution of the negative log likelihood ratio NCC1Z at the 
maximum likelihood point for 500 distributions, using the iterative smoothing function mechanism. 
The dashed curve shows the corresponding distribution in the case of a constant smoothing function. 



different numbers of Gaussian events superimposed on an exponential background. Table | 
shows the results of the fit. When a Gaussian of 500 events with width 0.2 and mean 2.0 
is superimposed on the exponential distribution of 1000 events, a value of N"CCTZ=189 is 
obtained while trying to fit for the exponential using the unbinned maximum likelihood 
method. This is 103<r away from the mean of the NCC1Z distribution shown in figure |] with 
the iterated smoothing function. A 3a effect is observed when the number of events in the 
Gaussian is 85. Figure || shows the generated events, the PDE and the fitted curve for this 
case. 

Let us note that it is possible to make a cumulative function from the solid histogram 
in figure |] and estimate the probability that NCC1Z exceeds the observed value, just as we 
do with x 2 tests. We have also performed a binned \ 2 fit to an exponential over the same 
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histograms, with the data in the fiducial region binned over 41 bins. The resulting value 
of x 2 f° r 39 degrees of freedom are shown in the last column in table |. At the 3a point 
for the unbinned method, the binned method yields a x 2 °f 42 over 39 degrees of freedom, 
which may be considered a good fit. This implies that the unbinned method may have more 
discriminating power against bad fits than the binned one. It is worth noting however that 
the binned fit is over two parameters (the number of events and the slope) whereas the 
unbinned fit being considered here is only over a single parameter, namely the slope. 



TABLE I: 



Number of 


Unbinned fit 


Unbinned fit 


Binned fit \ 2 


Gaussian events 


NCCK 


Na 


39 d.o.f. 


500 


189. 


103 


304 


250 


58.6 


31 


125 


100 


11.6 


4.9 


48 


85 


8.2 


3.0 


42 


75 


6.3 


1.9 


38 


50 


2.55 


-0.14 


30 





0.44 


-1.33 


24 



We now can fit the exponential data (with no superimposed Gaussian bumps) and com- 
pute the value of the likelihood ratio Cn = P p C ^ as a function of the parameter s about the 
maximum likelihood point. Figure |^ shows this function, which has the maximum value at 
s = 1.019. Note, however, that Cn-, a dimensionless quantity, is not the likelihood distribu- 
tion of s, which has to have the dimensions of 1/s. 



C. Determination of the a priori Likelihood P(s) 

In order to obtain the likelihood distribution of s, P(s\c n ), we need to understand better 
P(s), the a priori distribution of s. We are in a position to do this, since we have identified 
P(e) to be the distribution of data, and P(s) and P(c) are linked by the equation 

P( c ) = [ p( c \s) x P(s)ds (48) 
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FIG. 5: Figure shows the histogram (with errors) of 1000 events in the fiducial interval 1.0 < c < 5.0 
generated as an exponential with decay constant s=1.0. with a superimposed Gaussian of 85 events 
centered at c=2.0 and width=0.2. The PDE estimator is the (solid) histogram with no errors. 
The data are fitted with a goodness of fit that is 3er away from the average value of J\fCClZ. The 
continuous curve shows the fit to an exponential. 

We are in the process of using the a posteriori information contained in the data pdf P{c) 



to infer the a priori function P(s). Before we use equation ^ to calculate P(s), let us make 



the following observations. Using equation 31, we can write 



P( S |c n ) = P(s) x£ n = P(s) x (49) 



As we increase n, the number of events sampled, in the limit n — > oo we expect the 
a posteriori probability P(s\c n ) to tend towards the delta function 5(s — s*) where s* is the 
true value of s. This is because P(s|c n ) is the likelihood distribution of s and we expect 
to determine the true value of s with infinite precision in this limit. However, the ratio 
p£fy wm tend towards unity in this limit (for a good fit), since for each data point c&, the 
theoretical pdf P{cf.\s) and the data pdf P(ck) will be close to each other. The only way out 
of this is to allow P(s) to depend discretely on n and let the distribution P(s) — > 5(s — s*) 
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FIG. 6: Figure shows the likelihood ratio C-ji = p^) as a function of the fitted parameter s. The 
maximum likelihood point is at s = 1.019. 



as n — ► oo. We can see the need for this further, using equation |48|. In the limit n — ► oo, 
the data pdf P(c) will have the form P(c\s*), where s* is the true value of s, if it fits the 



theoretical model. Then the only solution for equation [Ig is again P(s) = 5(s — s*). 

To repeat, the only way out of this dilemma, is for the a priori probability distribution 
P(s) to be dependent on n and tend towards a delta function as n — > oo. If we are solving a 
Bayes theorem problem for n data points, then the a priori function P(s) for that problem 
will be written as P n (s) indicating that it comes from a discrete familiy of probability 
distributions that depend on n. Nothing further is known about P n (s) a priori except that 
it is a pdf in s and that the pdf depends discretely on n. 



1. Rewriting the Bayes Theorem Equations 

The Bayes theorem equations have to be re-written to take into account this change. 
Equation [H| now becomes 
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£k - = = -pm (50) 

Equation ^ remains as is and equation [34] becomes 

P(s|c n ) _ P(s|ci) P(s\c 2 ) P(s\c n ) 

Ln ' n - ~pw " ~w x ~w " ' x ~p^~ { } 

where we have also added the subscript n to the likelihood ratio C-ji to indicate its dependence 
on n. The recursive chain rule can now be rewritten as 

P(s|c k ) ^ P{Cj\s) 

1=1 



P(s Ci) rrP^ , , 

^ = flW = Sw (53) 

r -r _ ^(s|c k ) P{s\c{) _ P(s|c k+1 ) _'^f' f(cjs) 

Pfe(s) Pi{s) Pk+l{S) fj^ P{Ci) 

where we have two sub-samples of k and I events which are being combined to form a total 
number of k + I events. 



2. An Ansatz for P n (s) 

The expression for P(c) in equation |48| can be thought of as the theoretical prediction 
for P(c) and the PDE estimator is the experimental measurement of P(c). Then, one can 
write, 

P^(c n ) _ f P(c n \s) 

= J Cn,n(s) x P n (s)ds = J P(s\c n )ds = 1 (56) 

with the last expression following from Bayes theorem. There are two ways the last equation 
J P(s\c n )ds = 1 can be satisfied. Either the likelihood ratio Cn, n ( s ) = 1 or if 

Pnis) = " 2A (57) 

It is very difficult for C-ji n {s) to equal unity even at the maximum likelihood value, since 
the experimental PDE estimator in the denominator is subject to statistical fluctuations. 
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Equation |57], however, gives us an expression for the a priori likelihood P n (s). P n (s) is the 
value of the a priori probability distribution at the true value of s. Since J P n (s)ds = 1, 
we can satisfy this with a functional form for P n (s) being a step function 9(s\fi) such that 

with si = n — A (58) 

and S2 = n + A (59) 

P n (s) = 6{s\(jl) = Oif s < St or s > s 2 (60) 

P„,( S )=^( S | / i) = i-if Sl < S < S2 (61) 

and the value of fi, the mean of the distribution, is totally unknown. Let us note that it is 
possible to write down an equation such as equation only due to the fact that we are able 
to compute a dimensionless quantity such as the likelihood ratio Cnn( s ) an d this becomes 
possible only due to the concept of the PDE estimator for P(c) introduced in this paper. 
The integral in equation 57| has thus the dimensions of s giving P n (s) the dimensions of 



as required. 

As n increases, the value of A decreases, since the distribution P(c n \s) sharpens. This 
has the effect of narrowing P n (s) in accordance with the discussion above. It is important 
to realize that there is only one true value of s, and the Bayesian a priori probability P n (s) 
refers to the value of P n (s) at that true value, which, according to our ansatz, equals ^r. 
The data does not result from an admixture of probable values of s, but from a single true 
value of s. So Bayes theorem becomes, 



yielding 



P(s|c n ) x P(c n ) = P(c n \s) x P n (s) = P(c n \s) x ^- (62) 



P(c\ c ) - P ( c "l fi ) x J_ - £ k(s) _ P(c n \s) 
[ln) P(c n ) 2A jC n {s)ds JP(c n \s)ds [M) 



The last equation in ^3] results from the fact that the PDE estimator for P(c) cancels both 
in the numerator and denominator. Having obtained P(s\c n ), one can proceed to calculate 
the statistical quantities associated with s, namely the mean, mode, median, variance, errors 
and limits, in a rigorous fashion. We note here that P(s\c n ) is obtainable only with the use of 
of Bayes theorem, and our ansatz for the Bayesian a priori likelihood P(s). The expression 
for P(s|c n ), the a posteriori likelihood for s does not depend on the PDE estimator of data, 
but only on the theoretical function P(c n \s) evaluated at the data points. The evaluation 
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of P n (s) and the goodness of fit criteria both require the usage of the PDE estimator for 
the data pdf. 

The ansatz for the a priori distribution for P n (s) assumes a flat distribution in P n (s). 
This flatness may not be invariant under change of variables and the consequences of this 
needs further investigation. It is important to stress again that P n (s) in the Bayes theorem 
equations is the value of the a priori distribution, at the true value of s. The value of 
the function at the unknown true value of s is known to some statistical precision (= -r). 
We then use this to calculate the a posteriori distribution P(s|c n ) which gives information 
about the true value of s to some sttistical precision. Since we use the value of the function 
P n (s) at the true value of s only, we may not be sensitive to the shape of P n (s). Let us also 
note that we do not use the a priori distribution explicitly for any calculations, since the 
information about the error of s is contained in the normalized C-ji(s). Combining data from 



different datasets may be done by multiplying likelihood ratios as shown in equation 54 



We note in passing that the the values of A are not large enough to span the width of 
the likelihood distribution. Figure |7| shows the correlation between A and the ratio (3a/ A), 
where a is the rms of the likelihood ratio distribution, for 500 configurations c n of 1000 
events per configuration. At no point does the ratio fall below unity, indicating that the 
likelihood curve is always broader than the step function 6(s\fi). We may not blindly use the 
step function as the a priori distribution, centered at the maximum likelihood value and do 
the integral in equation |5B], since it will chop off the likelihood ratio curve in the tails. The 
step function can only be used after we feed it with a mean value fi. The function in the 
integral in equation ^ is in fact a constant which equals the value of P n {s) at the true value 
of s. This value does not change as we change s in the integral in equation |56|. It is possible 
that the true value of s is at the maximum likelihood point. It is also possible that it is at a 
value 3a away, albeit with a reduced probability. The key point is that the true value of s is 
either at the maximum likelihood point or at any of the other values at which the likelihood 
is non-zero. They do not simultaneously have to be true. Hence we can integrate over the 
whole likelihood distribution with P n (s) = ^- without worrying about falling off the edge 
of the step function. As one varies s, one is testing mutually exclusive hypotheses that the 
value of s under consideration is the true value of s. 

It is still instructive to see what happens when one supplies a distribution for the mean 
value of the step function. The following section deals with the self-consistency of our 
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FIG. 7: Figure shows a scatter plot of A, half the integral under the likelihood curve vs. 3<r /A, 
where a is the width of the likelihood distribution for 500 configurations. 

expressions, when one feeds in the a posteriori distribution P(s\c n ) for the mean value \i of 
the step function. 



3. The Bootstrap 

If the mean value /x of the step function distribution has a probability distribution -P(/i), 
then one can write an expression for the joint probability density of fi and s as 

P(/i) x 6(s\fi)dfids = ^^-dfids (64) 

2A 

inside the shaded region in figure |] and is zero outside. Integrating the above equation 
along the s axis first (fixed /i), followed by integration along the /i axis yields 

/ P(/i) xdfi 6{s\n)ds = 1 (65) 

J fl J s 

We can now reverse the order of integraton, doing the /i integration first, which yields 

-ds j 



— ds I P(p)dfi = 1 (66) 
2A 
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FIG. 8: The abscissa shows the variable s and the ordinate the variable jj,, the mean value of the 
9 function distribution. The hatched region shows the area over which the probability distribution 
for s is non zero as a function of fj,. 

This can be re-written as 

J^g(s)ds = l (67) 

where 

g(s)= F P(ji)dn (68) 

Js-X 

These equations are true for any P(fi). We need to supply a P(fi), which is the probability 
distribution of the mean value of the 9 function. The obvious candidate for P(/i) is clearly 
P(s\c n ), the a posteriori distribution for the true value of s. Since J g(s)ds = 2A, g(s) can 
be identified with C-r,(s), yielding 

/s+X 
P(s\c n )ds (69) 

For small A, we can Taylor expand the above integral yielding, 

Cn{s) w 2A x P(s|c n ) (70) 
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yielding the desired result 



p (s |c„) „ S^L = ^f) (71) 

2A J L,Ti{sjds 



Also, from equation |67|, we can identify P(s\c n ) = 4f^, since ^ is the projection of the 



probability density for s in figure [| along the s axis, computed after knowing c n . This again 
yields equation [71] and completes the bootstrap. 
To summarize the arguments so far, 

• We have made the ansatz that the a priori distribution for P n (s) is j c ^ s ^ ds ■ Such an 
ansatz gives us a distribution P n (s) whose mean value is unknown but whose width 
decreases with increased statistics. Both these properties qualify it as a candidate for 
the a priori distribution. This step requires a dimensionless C-ji and is only possible 
by the use of the experimental PDE's for the goodness of fit test, introduced in this 
paper. 

• We then supply it with a probability distribution for the mean value, which is only 
known after we have analyzed c n . The candidate for the probability distribution for 
the mean fx is P(s\c n ), which is the a posteriori distribution for the true value of s, and 
is the object of our quest. This is then used to calculate the probabilty distribution 
of s. 



This yields the expression for P(s\c n ) of equation [FT] as well as a probability density 



a posteriori for s that is consistent with the same equation. 
This results in 

J L n ^ n (s)ds J P{c n \s)ds 

for the a posteriori likelihood for s. 

In multi-dimensional parameter space, with a being the dimension of the parameter 
vector s. the above equations are generalized as follows 

1 1 

Pn(s) = JC n ,n(s)ds = (?3) 
with integrals over s being carried out over a dimensions. 
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VI. TOWARDS AN ANALYTIC THEORY OF UNBINNED LIKELIHOOD 
GOODNESS OF FIT 



Because the likelihood ratio C-ji(s) is invariant under transformation c — > c , one can use 
variables c such that 

c(c) = [ C p(c"\s)dc (74) 

This leads to probability distributions P(c'|s) such that 

P(c|s)=P(c|s) x |-^-| = l (75) 
etc 

and with the limits of the variable c being < c < 1. These sets of transformations in 
multi-dimensions is known as the hypercube transformation. The number density is constant 
in the hypercube which implies that we are not sensitive to systematics associated with the 
smoothing parameter. 

The theoretical curve is a constant =1 in this scheme. The experimental PDE will also 
be close to being flat. The question to answer is " What is the distribution of the negative 
log likelihood ratio MCCTZ that results from the statistical fluctuation of the PDE in the 
hypercube" ? We leave this question to a subsequent paper. 



VII. CONCLUSIONS 



We have introduced a technique for estimating goodness of fit in unbinned likelihood fits 
by the use of probability density estimators to obtain the a priori likelihood distribution of 
the data. In addition to providing a measure of the goodness of fit in unbinned likelihood fits 
for the first time, this approach enables us to obtain expressions for the a priori likelihood 
distribution of the theoretical parameters and hence to derive expressions for the a posteriori 
likelihood distributions of the theoretical parameters. We have shown that the a priori 
likelihood of the theoretical parameters depends on the number n of events being employed in 
the problem. We have emphasized that the a priori likelihood is the value of the probability 
distribution at the true value of s and this does not change as we change s, a posteriori, to 
calculate the likelihood that s is the true value. 

The approach outlined in this paper permits the rigorous calculation of errors in the 
fitted quantites. It makes unnecessary the practice of "guessing" the a priori likelihood 
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distributions of parameters, a practice titled "Bayesianism" . For the type of problems 
considered here, the a priori likelihood distributions can be computed. 

The techniques detailed here are extensible to arbitrary dimensions, even though we have 
used a one-dimensional problem for illustrative purposes. In the process of using probability 
density estimators, we have developed an algorithm for iteratively improving the smoothing 
parameter function of local number density. 
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